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Abstract: The turbot {Scophthalmus maximus) is a commercially valuable flatfish and one 
of the most promising aquaculture species in Europe. Two transcriptome 454-pyrosequencing 
runs were used in order to detect Single Nucleotide Polymorphisms (SNPs) in genes 
related to immune response and gonad differentiation. A total of 866 true SNPs were 
detected in 140 different contigs representing 262,093 bp as a whole. Only one true SNP 
was analyzed in each contig. One hundred and thirteen SNPs out of the 140 analyzed 
were feasible (genotyped), while III were polymorphic in a wild population. 
Transition/transversion ratio (1.354) was similar to that observed in other fish studies. 
Unbiased gene diversity (He) estimates ranged from 0.060 to 0.510 (mean = 0.351), 
minimum allele frequency (MAF) from 0.030 to 0.500 (mean = 0.259) and all loci were in 
Hardy- Weinberg equilibrium after Bonferroni correction. A large number of SNPs (49) 
were located in the coding region, 33 representing synonymous and 16 non-synonymous 
changes. Most SNP-containing genes were related to immune response and gonad 
differentiation processes, and could be candidates for functional changes leading to 
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phenotypic changes. These markers will be useful for population screening to look for 
adaptive variation in wild and domestic turbot. 

Keywords: turbot; Scophthalmus maximus; SNP validation; EST database; non-synonymous 
substitution; high-throughput genotyping 



1. Introduction 

The turbot {Scophthalmus maximus; Scophthalmidae, Pleuronectiformes) is a commercially 
valuable flatfish that has been intensively cultured since the 1980s. Its production has steadily 
increased up to the present figure of 8549 tons in 2011 (91.2% European production from Spain; [1]) 
and it appears to be one of the most promising aquaculture species in Europe. In response to turbot 
industry demands, genetic markers have been developed in this species in order to evaluate genetic 
resources in both wild and hatchery populations and perform parentage analysis to support genetic 
breeding programs [2-4]. These markers have also been applied to develop genomic tools to identify 
genomic regions associated with productive characters [5-7] and to detect selection footprints in wild 
populations [8]. Increasing growth rate, controlling sex ratio (females largely outgrow males) and 
enhancing disease resistance currently constitute the main goals of genetic breeding programs in 
this species. 

The necessity of understanding the immune response to pathogens of industrial relevance and to 
identify genes involved in the sex differentiation pathway led us to increase genomic resources in 
turbot. As a consequence thereof, an Expressed Sequence Tag (EST) database fi-om cDNA libraries of 
the main immune tissues was constructed using Sanger sequencing [9]. Recently, this database has 
been amplified with two 454 FLX runs [10,11] (454-Life Sciences, Brandford, CT, USA; for 
454-technique methodology see [12,13]). Next Generation Sequencing (NGS) technologies offer the 
ability to produce an enormous volume of data with a very low sequencing cost per base [12]. Thus, 
this turbot EST database is currently composed of -70,000 unique sequences (-20,000 contigs and 
~50,000 singletons). ESTs are essential to ascertain the gene [14,15], but also to identify polymorphic 
gene-associated markers, such as microsatellites and single nucleotide polymorphisms (SNPs) (type I 
markers; [9,16-18]). Type I markers are very useful for constructing genetic or physical maps, and for 
comparative mapping [7,19,20]. 

SNPs have several advantages over other markers when it comes to mapping genes or inferring 
population structure [21]. They can be easily evaluated in silico off public databases and their genotypes 
quickly assessed by mini-sequencing reactions [9,22] or by high-throughput technologies [23,24]. SNP 
alleles are almost exclusively identical-by-descent (IBD) and thus they prevent scoring errors 
associated to homoplasy [25]. They are extremely stable, due to low mutation rates [26], and occur 
more often in the genome than other markers. In the human genome, for instance, there is on average 
1 SNP per 300 bp [27], and their fi-equency in non-model species has been estimated at ~1 in 200-500 
bases for non-coding DNA and ~1 in 500-1000 bases for coding DNA [28]. In turbot. Vera et al. [29] 
estimated 1 true SNP every ~100 bp fi-om the EST database composed only of Sanger sequences, 
suggesting the existence of large SNP resources in this species. During the last decade, SNP discovery 
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pipelines have been developed for aquaculture species including fish [18,30-35], shellfish [36-38] and 
crustaceans [39,40]. In turbot, a SNP calling tool was included in the turbot database [9] and it has 
been refined in the updated version [11]. In this study, we screened genomic resources available in an 
updated version of the turbot EST database using contigs containing NGS 454-sequences to identify 
and characterize SNPs associated to immune- and reproduction-related genes. These markers will be 
used for further structural genomic analysis focused on quantitative trait loci (QTLs) linked to productive 
traits, as well as for population screening to look for adaptive variation in wild and domestic turbot. 

2. Results and Discussion 

2.1. Database Exploitation and SNP Detection 

The main characteristics of the turbot 454-transcriptome sequencing runs have been described in 
previous studies [10,11]. The used database (version 4.0 September 2011) was constituted by 
71,033 unique sequences, 18,880 contigs and 52,153 singletons including 454-sequences and Sanger 
sequences [9] with a total length of 52,402,177 base pairs (bp, ~52 Mb). However, in order to avoid 
duplicates with the previous SNPs developed from sequences obtained with Sanger methodology [29], 
and since we were mainly interested in validating SNPs at new immune- and reproduction-related 
genes, only contigs composed exclusively of at least four 454-sequences were used for SNP detection. 
Thus, 140 contigs from the turbot database, which met these requirements, were taken into account for 
the SNP development. The total length analyzed was 262,093 bp and contig length ranged from 728 bp 
to 4885 bp, with a mean length value of 1872.09 ± 746.69 bp. The total number of true SNPs detected 
using the program QualitySNP (for true SNP definition see the experimental section) was 866, SNP 
number per contig ranged from 1 to 58, with a mean value of 6.18 ± 8.34. Thus, the expected 
frequency of SNP appearance in the analyzed sequences would be 1 SNP every 302 bp. This value is 
lower than that previously reported in S. maximus (1 SNP each ~100 bp; [29]), but similar to those 
described in non-model species [28]. The success of any genotyping method is reflected in what is 
referred to as the conversion rate and the global success rate. The former only considers the 
poljmiorphic markers, whereas the latter considers all the markers (monomorphic and poljmiorphic) 
that were successfully typed within the analyzed samples [41]. Of the 140 true SNPs tested, 
27 (19.3%) could not be genotyped, and thus they were considered to be genotyping failures due to 
technical and/or genotyping problems. Only 2 out of the 113 feasible SNPs (see definition in the 
experimental section) were monomorphic. Therefore, the global success rate and conversion rate were 
80.7% and 79.3%, respectively. Global success rate was very similar to that previously described in 
the species (78.4%), but conversion rate was much higher than previously reported using sequences 
from cDNA libraries (37.7%; see Vera et al. [29]), likely due to the different library construction methods 
and bioinformatic pipeline approaches followed in 454 and Sanger contigs (see experimental section). 

2.2. SNP Performance 

A total of 65 transitions (A/G and C/T) and 48 transversions (A/C, A/T, C/G and G/T) were 
detected among feasible SNPs, A/G being the most common (35) and A/C the least common (6) 
substitutions observed (Figure 1). This represented a transition/transversion (ts/tv) ratio of 1.354. This 
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ratio was lower than that observed by Vera et al. [29] (1.885) and in silico (1.456) by Pardo et al. [9], but it 
was very similar to that described for common carp {Cyprinus carpio) (1.310) [42] and gilthead 
seabream (Sparus aurata) (1.375) [31]. Also, the most frequent transitions and transversions differed 
from previous reports: C/T and G/T, respectively [29], and A/G and A/C [9]. These discrepancies 
could be due to the opposite sequencing directions, as all sequences by Vera et al. [29] and Pardo et al. [9] 
were obtained from the 3' end using cDNA libraries, while those from the 454-run were randomly 
obtained by fragmentation of the whole cDNA according to the cDNA rapid library preparation 
method (Roche Farma, S. A. [43]). Moreover, the longer coding region portion analyzed in 454-runs 
regarding Sanger sequencing in our study may determine differences because of the different selective 
consfraints of UTR regarding coding regions. No differences were detected among disfribution of the 
variants between tested SNPs and feasible SNPs (j^ = 0.3115;;? = 0.9974). All polymorphic SNP loci 
showed two alleles and all of them agreed with those expected from the database information. 

Figure 1, Distribution of SNP variants analyzed in this study (a) using all SNPs tested; 
(b) using only feasible SNPs. Transitions (ts) and fransversions (tv) are indicated in black 
and grey colour, respectively. 
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2.3. SNP Diversity 

Only two loci among the 113 feasible SNPs were monomorphic (SmaSNP_287 and SmaSNP_334). 
Among poljmiorphic SNPs, unbiased gene diversity (He) estimates ranged from 0.060 at 
SmaSNP_237, SmaSNP_245 and SmaSNP_305 to 0.510 at SmaSNP_225 with a mean value of 
0.344 ± 0.149. The minimum allele frequency (MAF) in the polymorphic markers ranged from 0.030 
(SmaSNP_237, SmaSNP_245 and SmaSNP_305) to 0.500 in SmaSNP_249 with a mean value of 
0.259 ± 0.140. Departures from Hardy-Weinberg equilibrium (HWE) were detected in five markers 
(SmaSNP_253, SmaSNP_271, SmaSNP279, SmaSNP_289, SmaSNP_326; Table 1), although all 
markers were at equilibrium after Bonferroni correction {p = 0.0004). The samples from the 
Cantabrian turbot population were globally in accordance with HWE expectations when tested 
simultaneously for all loci {p = 0.9999). These polymorphic values were in the range to those 
previously described in the species [29], and they were also similar to those reported in other fish 
species [42,44]. No Linkage disequilibrium (LD) was detected among the 6328 loci pairs after 
Bonferroni correction (p = 0.0004). 



Table 1. Annotation, variants and diversity values of the 113 technically feasible SNPs in 
the Cantabric turbot population (33 individuals) used in this study. 



SNP Name 


Annotation 


Variants 




MAF 




He 


Fis 


SmaSNP_ 


211 




A/T 






0 1'?07 


0.262 


0 "^07 


SmaSNP_ 


212 


Zona pellucida speim-bmdmg protein 3 


A/G 


A 


= 0.152 


0.4198 


0.265 


0.179 


SmaSNP_ 


215 


Mitotic specific cycIin-Bl 


C/T 


T 


= 0.212 


0.2948 


0.338 


-0.255 


SmaSNP_ 


216 


Pre-mRNA branch site protein pl4 


A/T 


T 


= 0.348 


0.7003 


0.460 


-0.119 


SmaSNP_ 


217 


Zona pellucida protein CI 


A/G 


G 


= 0.258 


0.1616 


0.390 


0.301 


SmaSNP 


218 


Mitochondrial ribosomal protein S18A 


G/T 


T 


= 0.333 


1.0000 


0.452 


0.061 


SmaSNP_ 


219 


U3 small nucleolar ribonucleoprotein protein IMP3 


G/T 


G 


= 0.409 


1.0000 


0.491 


-0.050 


SmaSNP_ 


220 


Coatomer subunit epsilon isoform 1 


C/T 


T 


= 0.197 


0.5750 


0.322 


0.153 


SmaSNP_ 


222 


Signal recognition particle 14 kDa protein 


G/T 


G 


= 0.203 


1.0000 


0.329 


-0.046 


SmaSNP_ 


223 


Epithelial cell adhesion protein 


A/T 


T 


= 0.333 


1.0000 


0.452 


0.061 


SmaSNP_ 


224 


Transcription initiation factor TFIID subunit Dl 1 


C/G 


G 


= 0.182 


1.0000 


0.302 


-0.003 


SmaSNP 


225 


Acidic ribosomal protein PI 


A/G 


G 


= 0.480 


1.0000 


0.510 


0.059 


SmaSNP_ 


226 


Alcohol dehydrogenase Class-3 


C/T 


T 


= 0.288 


0.6913 


0.416 


-0.093 


SmaSNP_ 


227 


Thioredoxin protein 4A 


A/G 


A 


= 0.242 


1.0000 


0.373 


0.025 


SmaSNP_ 


228 


Novel protein similar to vertebrate THAP domain 
containing 4 (THAP4) 


A/G 


G 


= 0.212 


0.6068 


0.340 


0.109 


SmaSNP_ 


229 


Tumor suppressor candidate 2 


A/G 


A 


= 0.031 


1.0000 


0.061 


-0.016 


SmaSNP 


230 


Optic atrophy 3 protein 


C/T 


T 


= 0.266 


0.6477 


0.397 


0.135 


SmaSNP_ 


231 


RNA 3'-terminal phosphate cyclase 


A/C 


C 


= 0.266 


0.6475 


0.397 


0.135 


SmaSNP_ 


232 


RADl homo log 


A/G 


A 


= 0.438 


0.4921 


0.501 


0.127 


SmaSNP_ 


_233 


Ubiquitin carrier protein 


G/T 


T 


= 0.409 


1.0000 


0.491 


-0.050 


SmaSNP_ 


234 


chromatin accessibility complex protein 1 


A/G 


G 


= 0.047 


0.0504 


0.092 


0.659 


SmaSNP_ 


235 


Nucleolar protein 16 


A/G 


G 


= 0.258 


0.4023 


0.389 


0.144 


SmaSNP 


236 


Isopentenyl-diphosphate delta-isomerase 1 


C/G 


G 


= 0.141 


0.4763 


0.246 


0.111 


SmaSNP_ 


_237 


Ran-specific GTPase-activating protein 


G/T 


T 


= 0.030 


1.0000 


0.060 


-0.016 
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Table 1. Cont. 



SNP Name 


Annotation 


Variants 




MAF 


P (HW) 


He 


Fis 


SmaSNP_ 


_238 


Forkhead box HI 


A/G 


A 


= 0.453 


1.0000 


0.503 


-0.056 


SmaSNP_ 


_239 


Stathmin 


C/T 


C 


= 0.258 


0.6436 


0.387 


-0.174 


SmaSNP_ 


240 


Ubiquinol-cytochrome c reductase core I protein 


C/T 


C 


= 0.152 


0.5521 


0.261 


0.072 


SmaSNP_ 


241 


BolA-like protein 3 


C/G 


G 


= 0.313 


0.4371 


0.438 


0.143 


SmaSNP 


243 


ce ceroid-lipofuscinosis neuronal protein 5 


GIT 


G 


= 0.455 


1.0000 


0.504 


0.038 


SmaSNP 


244 


SSU rRNA; Psetta maxima (turbot) 


C/T 


C 


= 0.061 


1.0000 


0.116 


-0.049 


SmaSNP_ 


245 


Chromobox protein homolog 3 


G/T 


T 


= 0.030 


1.0000 


0.060 


-0.016 


SmaSNP_ 


246 


Transmembrane protem 208 


A/C 


A 


= 0.469 


0.7198 


0.505 


-0.114 


SmaSNP_ 


247 


Ribosomal protein LI 8a 


A/C 


A 


= 0.234 


1.0000 


0.365 


0.058 


SmaSNP_ 


248 


Pre-mRNA-processing factor 19 


C/T 


C 


= 0.318 


1.0000 


0.440 


-0.032 


SmaSNP 


249 


Alpha-L-fucosidase 


A/G 


A 


= 0.500 


0.7275 


0.509 


0.106 


SmaSNP_ 


_250 


Protein phosphatase 2 (Formerly 2A) 


A/G 


G 


= 0.406 


0.0598 


0.493 


0.366 


SmaSNP_ 


252 


LON peptidase A^-terminal domain and RING 
finger protein 1 


G/T 


T 


= 0.167 


1.0000 


0.282 


0.034 


SmaSNP_ 


_253 


IK cytokine 


A/G 


A 


= 0.439 


0.0348 


0.497 


-0.402 


SmaSNP_ 


_256 


Ribonuclease UKl 14 


C/T 


C 


= 0.232 


0.6038 


0.364 


0.116 


SmaSNP 


257 


Inner centromere protein 


A/G 


G 


= 0.303 


0.4239 


0.430 


0.154 


SmaSNP 


259 


Beta-galactoside-binding lectin 


C/T 


C 


= 0.379 


0.7242 


0.477 


-0.079 


SmaSNP_ 


_260 


Enoyl-Coenzyme A hydratase 


A/T 


A 


= 0.273 


0.3819 


0.402 


-0.208 


SmaSNP_ 


261 


Sept2 protein 


A/G 


G 


= 0.197 


1.0000 


0.321 


-0.038 


SmaSNP_ 


262 


DNA-directed RNA polymerase I subunit RPA34 


A/G 


A 


= 0.078 


1.0000 


0.146 


-0.069 


SmaSNP_ 


_263 


Epithelial membrane protem 2 


A/G 


G 


= 0.379 


0.1336 


0.480 


0.306 


SmaSNP 


264 


Retinol dehydrogenase 3 


C/G 


G 


= 0.409 


1.0000 


0.491 


-0.050 


SmaSNP_ 


_265 


WD repeat-containing protein 54 


A/G 


A 


= 0.076 


1.0000 


0.142 


-0.067 


SmaSNP_ 


_266 


tRNA pseudouridine synthase 3 


C/T 


C 


= 0.136 


0.4637 


0.240 


0.115 


SmaSNP_ 


_267 


Transmembrane protein 167 precursor 


G/T 


G 


= 0.258 


0.6463 


0.387 


-0.174 


SmaSNP_ 


_270 


Flotillin-1 


C/G 


G 


= 0.438 


0.1694 


0.502 


0.253 


SmaSNP_ 


271 


NAD(P)H dehydrogenase quinone 1 


A/G 


G 


= 0.359 


0.0488 


0.471 


0.403 


SmaSNP 


273 


Ubiquitin protein ligase E3 component 


C/T 


C 


= 0.484 


0.7353 


0.508 


0.077 


SmaSNP_ 


_274 


K 132 13 matrin 3 


C/G 


G 


= 0.106 


0.2983 


0.193 


0.216 


SmaSNP_ 


_275 


Dolichol-phosphate mannosyltransferase 


A/G 


A 


= 0.091 


0.2209 


0.169 


0.281 


SmaSNP_ 


_276 


DNA-directed RNA polymerases i n and HI 
subunit rpabcl 


A/G 


A 


= 0.197 


0.5728 


0.322 


0.153 


SmaSNP_ 


_277 


Syndecan 2 


A/C 


A 


= 0.429 


1.0000 


0.505 


-0.130 


SmaSNP 


278 


Peptide methionine sulfoxide reductase 


C/G 


C 


= 0.078 


1.0000 


0.146 


-0.069 


SmaSNP 


_279 


Methyltransferase-like protein 21D 


G/T 


G 


= 0.470 


0.0129 


0.509 


0.465 


SmaSNP_ 


281 


Phosphatidylinositol transfer protein beta 
isoform-like isoform 2 


C/T 


T 


= 0.318 


0.4333 


0.439 


-0.172 


SmaSNP_ 


282 


Zona pellucida protein C 


A/T 


T 


= 0.121 


1.0000 


0.216 


-0.123 


SmaSNP_ 


_283 


AP-2 complex subunit alpha-2-like 


G/T 


G 


= 0.333 


0.2669 


0.450 


-0.213 


SmaSNP 


284 


Apoptosis regulator BAX 


A/G 


G 


= 0.409 


0.0780 


0.493 


0.324 


SmaSNP 


_285 


Borealin 


G/T 


T 


= 0.032 


1.0000 


0.063 


-0.017 


SmaSNP_ 


_286 


Brain protein 44 


C/T 


C 


= 0.394 


0.2691 


0.483 


-0.255 


SmaSNP_ 


_287 


Exosome component 8 


A/G 


G 


=1.000 




0.000 


NA 
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Table 1. Cont. 



SNP Name 


Annotation 


Variants 




MAF 


P (HW) 


He 


Fis 


SmaSNP_ 


_288 


Atrophin-l domain containuig protein 


GIT 


T 


= 0.439 


1.0000 


0.500 


-0.030 


SmaSNP_ 


_289 


similar to coimectin/titin 


AIT 


A 


= 0.303 


0.0018 


0.433 


0.580 


SmaSNP_ 


_290 


Ubiquitin carboxyl-terminal hydrolase L5 


C/T 


T 


= 0.469 


1.0000 


0.506 


0.012 


SmaSNP 


292 


Histone deacetylase complex subunit SAP 18 


C/T 


C 


= 0.188 


0.5568 


0.308 


-0.216 


SmaSNP 


293 


Replication protein A 14 kDa subunit 


C/G 


G 


= 0.182 


1.0000 


0.302 


-0.003 


SmaSNP_ 


_296 


Carbonic anhydrase 


G/T 


G 


= 0.076 


1.0000 


0.142 


-0.067 


SmaSNP, 


_297 


UPF0414 transmembrane protein 


C/T 


C 


= 0.212 


0.6080 


0.340 


0.109 


SmaSNP_ 


_298 


Queuine tRNA-ribosyltransferase 


C/T 


T 


= 0.061 


1.0000 


0.116 


-0.049 


SmaSNP_ 


_299 


NHP2-Iike protein 1 


C/G 


C 


= 0.379 


0.1358 


0.480 


0.306 


SmaSNP_ 


_304 


Microsomal glutathione ^'-transferase 3 


A/G 


A 


= 0.091 


1.0000 


0.168 


-0.085 


SmaSNP 


305 


Actin related protein 2/3 complex subunit 4 


C/T 


T 


= 0.030 


1.0000 


0.060 


-0.016 


SmaSNP 


306 


Cyclophilin B 


C/G 


C 


= 0.061 


1.0000 


0.116 


-0.049 


SmaSNP_ 


_307 


Dynein light chain Tctex-type 3 


C/G 


C 


= 0.061 


1.0000 


0.116 


-0.049 


SmaSNP_ 


_308 


Ependymin-1 


A/G 


A 


= 0.234 


0.3135 


0.366 


0.231 


SmaSNP_ 


_309 


C-4 methylsterol oxidase 


A/G 


A 


= 0.297 


1.0000 


0.424 


0.043 


SmaSNP_ 


_310 


Dynein light chain LC8-type 


G/T 


T 


= 0.045 


1.0000 


0.088 


-0.032 


SmaSNP 


311 


Rho-related GTP-binding protein RhoF 


A/T 


T 


= 0.394 


0.2669 


0.483 


-0.255 


SmaSNP 


312 


Golgi SNAP receptor complex member 1 


A/T 


A 


= 0.188 


0.5587 


0.308 


-0.216 


SmaSNP_ 


314 


RibosomalLl domain-containing protein 1 


A/G 


A 


= 0.203 


1.0000 


0.329 


-0.046 


SmaSNP_ 


.315 


Af-alpha-acetyltransferase 50 


A/T 


A 


= 0.242 


1.0000 


0.373 


0.025 


SmaSNP_ 


.316 


Oncogene DJ-1 isoform 1 


C/T 


C 


= 0.453 


1.0000 


0.503 


-0.056 


SmaSNP_ 


-317 


Wu:f]40dl2 protein n = l Tax = Euteleostomi 
RepID = A3KP2 1 DANRE 


A/G 


A 


= 0.438 


1.0000 


0.500 


0.000 


SmaSNP 


_318 


Mucin multi-domain protein 


C/G 


C 


= 0.167 


0.5617 


0.281 


-0.185 


SmaSNP_ 


_319 


Adenosine kinase 


A/G 


A 


= 0.182 


0.5575 


0.301 


-0.208 


SmaSNP_ 


_320 


No homology found 


A/G 


A 


= 0.394 


0.4901 


0.486 


0.127 


SmaSNP_ 


321 


Zymogen granule membrane protein 16 


A/G 


G 


= 0.333 


1.0000 


0.451 


-0.076 


SmaSNP 


322 


6-Pyruvoyl tetrahydrobiopterin synthase 


C/T 


C 


= 0.031 


1.0000 


0.061 


-0.016 


SmaSNP 


_323 


Proteasome subunit beta 


C/T 


T 


= 0.125 


1.0000 


0.222 


-0.127 


SmaSNP 


324 


RING tmger protem 4 


A/G 


A 


= 0.394 


0.0652 


0.488 


0.379 


SmaSNP_ 


_325 


Lipocalin 


C/G 


C 


= 0.136 


1.0000 


0.239 


-0.143 


SmaSNP_ 


_326 


Choline transporter-like protein 2 


A/G 


G 


= 0.455 


0.0311 


0.507 


0.402 


SmaSNP_ 


_328 


RNA-binding proteins (RRM domain) 


C/T 


C 


= 0.106 


1.0000 


0.192 


-0.103 


SmaSNP_ 


_329 


Type II keratin 


C/G 


G 


= 0.061 


1.0000 


0.116 


-0.049 


SmaSNP_ 


_330 


Novel protem similar to vertebrate thyroid 
hormone receptor interactor 12 (TRIP 12) 


C/T 


T 


= 0.094 


1.0000 


0.172 


-0.088 


SmaSNP_ 


_332 


Ribosomal protein S6 kinase 


A/C 


A 


= 0.470 


0.7287 


0.507 


0.103 


SmaSNP_ 


_333 


Transmembrane 6 superfamily member 2 


A/T 


T 


= 0.288 


0.0796 


0.419 


0.348 


SmaSNP_ 


_334 


PREDICTED: hypothetical protein 
LOC100712283 [Oreochromis niloticus] 


C/T 


T 


=1.000 




0.000 


NA 


SmaSNP 


_337 


1 -Alkyl-2-acetylglycerophosphocholine esterase 


C/T 


C 


= 0.234 


1.0000 


0.365 


0.058 


SmaSNP_ 


_338 


CD151 antigen 


C/T 


T 


= 0.266 


0.3909 


0.395 


-0.186 


SmaSNP_ 


_339 


Arsenite methyltransferase 1 


A/T 


A 


= 0.313 


1.0000 


0.436 


-0.002 


SmaSNP_ 


_340 


Receptor expression-enhancing protein 5 


C/T 


T 


= 0.234 


0.6507 


0.364 


-0.116 
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Table 1. Cont. 



oNP Name 


Annotation 


Variants 




MAF 


P (HW) 


He 


Fis 


SmaSNP_341 


Cathepsin S 


C/G 


G 


= 0.333 


0.1119 


0.454 


0.332 


SmaSNP_342 


Trans- 1 ,2-dihydrobenzene- 1 ,2-diol dehydrogenase 


A/G 


A 


= 0.424 


0.2818 


0.494 


-0.226 


SmaSNP_343 


High mobility group protein 2 


G/T 


G 


= 0.470 


0.2980 


0.508 


0.224 


SmaSNP_346 


ATP-binding cassette, sub-family A (ABCl) 


C/T 


T 


= 0.288 


1.0000 


0.417 


0.055 


SmaSNP_347 


Myomesin la (skelemin) 


C/T 


T 


= 0.091 


1.0000 


0.168 


-0.085 


SmaSNP_348 


Retinoic acid receptor responder protein 3 


A/G 


G 


= 0.439 


1.0000 


0.500 


-0.030 


SmaSNP_349 


Nucleophosmin 1 


A/C 


A 


= 0.258 


0.6466 


0.387 


-0.174 



2.4. SNP Position within Genes: Synonymous vs. Non-Synonymous Substitutions 

Consensus sequences of contigs containing polymorphic SNPs were compared using NCBI BLAST 
with pubhc databases, namely UniRe^O, NCBI's nr, KEGG, COG, PFAM, LSU and SSU. The 
subsequent BLAST output was then parsed with Auto FACT [45]. All contigs containing feasible 
SNPs were annotated (except SmaSNP_320, Table 1). The informative strand, reading frame, and stop 
codon at each contig were recorded using homology with the highest homologous annotated sequence 
in public databases. Nine feasible SNPs (8.0%) could not be positioned, because no consistent reading 
frames were detected (indicated as "unknown" location on Table 2). Fifty-five SNPs (48.7%) were 
located in untranslated regions (UTR), either in the 5' UTR (17, 15.0%) or 3' UTR (38, 33.6%), which 
is in accordance with the approximately double length of 3' compared to 5' UTR [9]. On the other 
hand, 49 SNPs (43.4%) were localized in the coding region (Table 2), a percentage of SNPs higher 
than previously reported in the species (24.7%, [29]) and in other aquaculture fish species (e.g., 
Atlantic salmon 24%, [32]; Atlantic cod 17.4%, [34]). All these studies followed a 3' UTR Sanger 
sequencing strategy, and therefore the coding region was less represented than in the case of the 454 
Roche runs after a cDNA rapid library preparation protocol, which accounts for the differences 
observed. This result shows the utility of the NGS methodologies for SNP detection in the coding 
region. Thirty-three (29.2%) of these 49 SNPs were sjoionjmious, and 16 (14.2%) were 
non-synonjonous. On the other hand, the relationship between synonjmious vs. non-sjoionymous 
changes (2:1) was lower than in other species [46,47]. Evolutionary constraints should preferentially 
eliminate non-synonymous variation because it is usually associated with deleterious mutations [35]. 

Non-synonymous SNPs in coding regions represent alternative allelic variants of a gene, which can 
determine functional changes in the corresponding proteins and lead to phenotypic changes. Among 
these genes there can be found a retinol dehydrogenase (SmaSNP_264), three zona pellucida proteins 
(SmaSNP_212, SmaSNP_217, SmaSNP_282) related to reproduction processes, and a lipocalin 
(SmaSNP_325) involved in tear secretion (Table 2). 

In the present study, we used sequences obtained from two transcriptome 454-pyrosequencing runs, 
one related to immune system [10] and another one Irom the hypothalamic pituitary-gonad axis [11]. 
Thus, GO terms were mainly related to immune response and reproduction processes (Table 2). The 
non-synonymous variation was associated with genes involving either immune response or sex 
differentiation processes. A large number of SNP linked to annotated genes were identified and validated. 
This set of markers are being used for population genomic studies and turbot genetic map enrichment, 
both approaches providing useful information for evolutionary and turbot industry applied studies. 
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Table 2. Predicted position, SNP location within genes and their correspondent 
synonymous vs. non-synonymous variants of the 113 technically feasible SNPs. 



SNP Name 


SNP location/effect 


GO term 


SmaSNP. 


.211 


3'UTR 


phosphorylation (GO:00 1 63 1 0) 


SmaSNP_ 


212 


Non synonymous 


reproduction (00:0000003) 


SmaSNP_ 


215 


Synonymous 


mitotic cell cycle (00:0000278) 


SmaSNP_ 


216 


3'UTR 


protein localization to cell division site (00:0072741) 


SmaSNP 


217 


Non synonymous 


binding of sperm to zona pellucida ( 00:0007339) 


SmaSNP 


218 


Non synonymous 


protein import into mitochondrial matrix (00:0030150) 


SmaSNP_ 


219 


3'UTR 


ribonucleoprotein complex biogenesis (00:0022613) 


SmaSNP_ 


220 


Synonymous 


ribosomal large subunit assembly (00:0000027) 


SmaSNP_ 


222 


5'UTR 


regulation of peptidoglycan recognition protein signaling pathway (00:0061058) 


SmaSNP_ 


223 


Synonymous 


cell adhesion (00:0007155) 


SmaSNP 


_224 


Synonymous 


DNA-dependent transcription, initiation (00:0006352) 


SmaSNP 


_225 


3'UTR 


ribosomal large subunit assembly (00:0000027) 


SmaSNP_ 


226 


Synonymous 


cellular alcohol metabolic process (00:0044107) 


SmaSNP_ 


227 


Synonymous 


thioredoxin biosynthetic process (00:0042964) 


SmaSNP_ 


228 


5'UTR 


regulation of nucleotide-binding oligomerization domain containing signaling 








pathway (00:0070424 ) 


SmaSNP_ 


229 


3'UTR 


immune response to tumor cell (00:0002418) 


SmaSNP 


230 


3' UTR 


reproduction (00:0000003) 


SmaSNP_ 


231 


Non synonymous 


phosphorylation of RNA polymerase II C-terminal domain (00:0070816) 


SmaSNP_ 


232 


Synonymous 


resolution of meiotic recombination intermediates (00:0000712) 


SmaSNP_ 


_233 


Synonymous 


ubiquitin-dependent protein catabolic process (00:000651 1) 


SmaSNP_ 


234 


3'UTR 


regulation of macrophage inflammatory protein 1 alpha production (00:0071640) 


SmaSNP 


_235 


Synonymous 


protein localization to nucleolar rDNA repeats (00:0034503) 


SmaSNP 


236 


Synonymous 


T-helper 1 cell activation (00:0035711) 


SmaSNP_ 


_237 


5'UTR 


termination of O-protein coupled receptor signaling pathway (00:0038032) 


SmaSNP_ 


_238 


Non synonymous 


transcription initiation from RNA polymerase HI type 2 promoter (00:0001023) 


SmaSNP_ 


_239 


5'UTR 


Not found 


SmaSNP_ 


240 


Synonymous 


MHC class I protein complex assembly (00:0002397) 


SmaSNP 


241 


3'UTR 


reproduction (00:0000003) 


SmaSNP 


243 


Non synonymous 


neuronal stem cell maintenance (00:0097150) 


SmaSNP_ 


244 


Unknown 


Not found 


SmaSNP_ 


245 


5'UTR 


reproduction (00:0000003) 


SmaSNP_ 


246 


Synonymous 


intracellular protein transmembrane transport (00:0065002) 


SmaSNP_ 


247 


Non synonymous 


ribosomal protein import into nucleus (00:0006610) 


SmaSNP 


248 


3'UTR 


regulation of mitotic recombination (0000019) 


SmaSNP 


249 


3'UTR 


alpha-L-fucosidase activity (00:0004560) 


SmaSNP_ 


_250 


3'UTR 


modulation by virus of host protein serine/threonine phosphatase activity 








(00:0039517) 


SmaSNP_ 


252 


3'UTR 


regulation of macrophage inflammatory protein 1 alpha production (00:0071640) 


SmaSNP_ 


_253 


Synonymous 


regulation of cytokinesis (00:0032465) 


SmaSNP_ 


_256 


Synonymous 


regulation of ribonuclease activity (00:0060700) 
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Table 2. Cont. 



SNP Name 


oNF location/ellect 


CO term 


SmaSNP_ 


_257 


Synonymous 


centromere complex assembly (GO:0034508) 


SmaSNP_ 


_259 


3'UTR 


complement activation, lectin pathway (GO:0001867) 


SmaSNP_ 


260 


3'UTR 


amitosis (GO:0051337) 


SmaSNP 


261 


3'UTR 


protein processing (00:0016485) 


SmaSNP 


262 


Synonymous 


RNA polymerase 1 transcriptional preinitiation complex assembly (GO:0001 188) 


SmaSNP 


263 


Synonymous 


membrane protein proteolysis (GO:0033619) 


SmaSNP_ 


264 


Non synonymous 


reproduction (GO:0000003) 


SmaSNP_ 


_265 


5'UTR 


ribosomal subumt export from nucleus (GO:0000054) 


SmaSNP_ 


_266 


Synonymous 


reproduction (GO:0000003) 


SmaSNP 


267 


3 UTR 


J.1 J*l* J.1 * IJ* Ij.* J!* Ill 1 11 11 

smoothened signalmg pathway involved m regulation of cerebellar granule cell precursor cell 

proliferation (GO:0021938) 


SmaSNP_ 


270 


3' UTR 


flotilhn complex (GO:00 16600) 


SmaSNP_ 


271 


Synonymous 


NAD(P)H dehydrogenase complex assembly (GO:00 10275) 


SmaSNP 


_273 


3 UTR 


regulation of ubiquitin-protein hgase activity mvolved m mitotic cell cycle (GO:0051439) 


SmaSNP_ 


274 


Unknown 


reproduction (GO:0000003) 


SmaSNP_ 


_275 


Synonymous 


dohchyl-phosphate beta-D-mannosyltransf erase activity (GO: 00045 82) 


SmaSNP 


276 


Synonymous 


transcription from RNA polymerase III type 2 promoter (GO:0001009) 


SmaSNP_ 


_277 


3' UTR 


T-helper 2 cell activation (GO:0035712) 


SmaSNP_ 


_278 


3' UTR 


cellular response to methionine (GO:0061431) 


SmaSNP_ 


_279 


Non synonymous 


protem import (GO:0017038) 


SmaSNP_ 


281 


5'UTR 


regulation of beta 2 integrm biosynthetic process (GO:0045 1 15) 


SmaSNP_ 


282 


Non synonymous 


regulation of binding of sperm to zona pellucida (GO:2000359) 


SmaSNP 


283 


3' UTR 


cellular macromolccular complex subunit organization (GO:0034621) 


SmaSNP 


284 


Non synonymous 


regulation ot apoptotic process (00:0042981) 


SmaSNP_ 


_285 


Synonymous 


chromosome passenger complex localization to kinetochore (00:0072356) 


SmaSNP_ 


_286 


5' UTR 


brain development (00:0007420) 


SmaSNP_ 


_287 


Non synonymous 


Jill •! 11 /'y^ /~\ 1 /A^ix 

extracellular vesicular exosome assembly (00:0071971) 


SmaSNP 


288 


Unknown 


Not found 


SmaSNP 


289 


Unknown 


Not found 


SmaSNP_ 


290 


Synonymous 


regulation of ubiquitin-specific protease activity (00:2000152) 


SmaSNP_ 


292 


3' UTR 


suppression by virus oi host TAP complex (00:0039589) 


SmaSNP_ 


_293 


3' UTR 


DNA replication premitiation complex assembly (00:0071 163) 


SmaSNP_ 


_296 


Synonymous 


carbon utilization (00:0015976) 


SmaSNP_ 


_297 


3 UTR 


membrane protein proteolysis (00:0033619) 


SmaSNP 


298 


Non synonymous 


queuine tRNA-ribosyltransferase activity (00:0008479) 


SmaSNP 


299 


Synonymous 


Not found 


SmaSNP_ 


_304 


Synonymous 


reproduction (00:0000003) 


SmaSNP_ 


.305 


3'UTR 


protein-DNA complex subunit organization (00:0071824) 


SmaSNP_ 


.306 


3'UTR 


behavioral response to stimulus (00:0007610) 


SmaSNP_ 


.307 


Synonymous 


reproduction (00:0000003) 


SmaSNP 


308 


Non synonymous 


Not found 


SmaSNP 


309 


Synonymous 


testosterone secretion (00:0035936) 


SmaSNP_ 


_310 


3'UTR 


dynein-driven meiotic oscillatory nuclear movement (00:0030989) 


SmaSNP_ 


.311 


3'UTR 


suppression by virus of host tapasin activity (00:0039591) 
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Table 2. Cont. 



SNP Name 


SNP location/effect 


GO term 


SmaSNP. 


.312 


5'UTR 


Not found 


SmaSNP_314 


Synonymous 


regulation of macrophage inflammatory protein 1 alpha production (GO:0071640) 


SmaSNP. 


.315 


3'UTR 


menopause (GO:0042697) 


SmaSNP. 


.316 


3' UTR 


T-helper 1 cell activation (GO:0035711) 


SmaSNP_317 


5'UTR 


Not found 


SNP Name 


SNP location/effect 


GO term 


SmaSNP. 


.318 


Unknown 


Not found 


SmaSNP. 


.319 


5' UTR 


phosphorylation (GO;0016310) 


SmaSNP. 


.320 


Unknown 


Not found 


SmaSNP. 


.321 


3'UTR 


Golgi to plasma membrane protein transport (GO:0043001) 


SmaSNP 


322 


3'UTR 


regulation of ATP citrate synthase activity (GO:2000983) 


SmaSNP 


323 


3'UTR 


regulation of G-protein beta subunit-mediated signal transduction in response to host (GO:0075162) 


SmaSNP. 


.324 


Non synonymous 


cytokinesis, actomyosin contractile ring assembly (GO:0000915) 


SmaSNP. 


.325 


Non synonymous 


tear secretion (GO:0070075) 


SmaSNP. 


.326 


5'UTR 


Not found 


SmaSNP. 


.328 


Unknown 


Not found 


SmaSNP 


329 


5' UTR 


regulation of type II hypersensitivity (GO:0002892) 


SmaSNP 


330 


Unknown 


Not found 


SmaSNP. 


.332 


5'UTR 


phosphorylation (GO;0016310) 


SmaSNP. 


.333 


5' UTR 


Not found 


SmaSNP. 


.334 


5'UTR 


Not found 


SmaSNP. 


.337 


3' UTR 


juvemle-hormone esterase activity (GO:0004453) 


SmaSNP 


338 


3'UTR 


inflammatory response to antigenic stimulus (GO:0002437) 


SmaSNP 


339 


Synonymous 


T-helper 1 cell activation (GO:0035711) 


SmaSNP. 


.340 


Synonymous 


regulation of G-protein coupled receptor protein signaling pathway (GO:0008277) 


SmaSNP. 


.341 


Synonymous 


sperm entry (GO:0035037) 


SmaSNP. 


.342 


Unknown 


Not found 


SmaSNP. 


.343 


Synonymous 


collagen metabolic process (GO:0032963) 


SmaSNP 


346 


3'UTR 


chromatin silencing at silent mating-type cassette (GO:0030466) 


SmaSNP 


347 


3'UTR 


nucleoside oxidase activity (GO:0033715) 


SmaSNP. 


.348 


5'UTR 


retinoic acid receptor signaling pathway (GO:0048384) 


SmaSNP. 


.349 


3'UTR 


T-helper 1 cell activation (GO:0035711) 



3. Experimental Section 

3.1. EST Database, SNP Detection and Annotation 



Sequences were obtained from two transcriptome 454-pyrosequencing runs of turbot cDNA 
libraries, one belonging to the immune transcriptome [10] and another one from the hypothalamic 
pituitary-gonad axis [1 1]. A brief description of both runs is shown in Table 3. All the 454-reads were 
assembled with MIRA [48], and they make up the 454-sequences incorporated into the turbot database. 
In order to create contigs and detect SNPs, these 454-sequences were assembled alongside Sanger 
sequences available [9] in the database with CAP3 [49] using default parameters. This is a common 
sfrategy when dealing with hybrid Sanger-454 assemblies [50]. The resulting ACE format assembly 
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file was fed into Quality SNP [51] in conformity with the bioinformatic pipeline described by 
Vera et al. [29]. Briefly, Quality SNP uses three filters for the identification of reliable SNPs: Filter 1 
screens for all potential SNPs with the requirement that every allele is represented in more than one 
sequence (each contig has to have at least a depth of 4 sequences); filter 2 uses a haplotype-based 
strategy to detect reliable SNPs after reconstructing confident haplotypes with an algorithm that 
minimizes false haplotypes due to the occurrence of sequencing errors; and filter three screen SNPs by 
calculating a confidence score based on sequence redundancy and quality (only sequences with 
PHRED >20 were used). SNPs that pass filters 1 and 2 are called real SNPs and those passing all 
filters are called true SNPs [51]. Resulting files were processed with our own custom Perl programs to 
extract relevant information. The obtained data were imported into a my SQL server [52]. A 
user-friendly web access interface was designed so that contig graphs are clickable and the output 
visually refined with color-coded nucleotide views [53]. A graphical interface allowing for SNP 
database search by alleles, contig depth, and annotation was set up. EST annotation of these contigs 
was performed using BLASTx, which searches proteins using a translated nucleotide query [54]. Only 
E-values lower than 10"^ were considered for gene annotation (Table 1, Table SI). 

Table 3. Description of two transcriptome 454-pyrosequencing runs of turbot. 

Inmune ' Hypothalamic pituitary-gonad axis ^ 

Samples 

Niunber of individuals 52 30 

Origin Commercial fish farm Commercial fish farm 

Data 

Number of reads 915,782 1,191,866 

Total megabases (Mb) 29 1 .04 34 1 .20 

Average read length 317.8 286.0 
Assembly 

Number of contigs 55,504 65,472 

Mean length (bp) 671.3 625.9 

Average contig coverage 4.4 4.6 

' From Pereiro et al. [10]; ^ From Rivas et al. [11]. 

3.2. SNP Genotyping and Validation 

DNA of all individuals analyzed was extracted from a piece of caudal fin using standard 
phenol-chloroform procedures [55]. 

SNPs identified were validated and genotyped with the MassARRAY platform (Sequenom, 
San Diego, CA, USA) following the protocols and recommendations provided by the manufacturer. 
Briefly, the technique consists of an initial locus-specific polymerase chain reaction (PCR), followed 
by single-base extension using mass-modified dideoxynucleotide terminators of an oligonucleotide 
primer that anneals immediately upstream of the polymorphic site (SNP) of interest (see [56,57] for 
more technical information). The distinct mass of the extended primer identifies the SNP allele. Primer 
sequences, SNP position, expected variants and annotation for the 140 tested SNPs are shown on 
Supplementary Table 1. MALDI-TOF mass spectrometry analysis in an Autoflex spectrometer was 
used for allele scoring. 
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Assays were designed for 140 true SNPs always located in different sequences and were combined 
in 7 multiplex reactions including 24 SNPs each except for multiplex 5 (23 SNPs), 6 (18 SNPs) and 
7 (3 SNPs) (see Supplementary Table 1 for multiplex information). SNP multiplexes were designed 
in silico and tested on a panel of 8 turbot individuals from a wild Cantabrian (northern Spain) 
population. SNPs were classified based on manual inspection as "failed assays" (in case that the 
majority of genotypes could not be scored and/or the samples did not cluster well according to 
genotype), and feasible SNPs (markers with proper and reliable genotypes), these being either 
monomorphic or polymorphic. 

3.3. Gene Diversity and Population Analysis 

In order to estimate genetic diversity parameters, all SNPs were genotyped for polymorphism 
evaluation in a sample of 33 individuals (including the 8 individuals used for marker performance) 
from the wild Cantabrian population previously used. 

Estimates of genetic diversity (unbiased expected heterozygosity (He) and minimum allele 
frequency (MAP)) were estimated using FSTAT 2.9.3 [58]. The conformance to Hardy-Weinberg 
(HW) and genotypic equilibria were obtained using GENEPOP 4.0 [59,60]. Conformance to HWE was 
checked using the complete enumeration method [61] because only two alleles were detected at each 
locus. Bonferroni correction was applied when multiple tests were performed [62]. 

3.4. Detection of Synonymous/Non-Synonymous SNPs 

All the six possible reading frames of the consensus sequence of each containing SNP functionally 
annotated contig were obtained using ORE (Open Reading Frame) Finder application [63]. The best 
candidate frame (usually the longest one) was compared against the NCBI protein database using 
BLASTp and BLASTx, and the protein with highest E-value was downloaded and aligned with the 
selected frame for SNP location using Clustal W [64] implemented in BioEdit v. 7.1. [65]. This 
approach enabled us to locate SNPs by looking at the coding region. For those SNPs in the coding 
region, the resulting amino acid sequences of both variants were translated to determine whether SNP 
variants were synonymous or non-synonjonous. Gene onthology (GO) terms were searched using 
QUICKGO [66] and AmiGO [67] utilities. 

4. Conclusions 

A total of 140 contigs (total length 262,093 bp) formed exclusively by 454-pyrosequencing reads 
were used to identify new putative SNPs in S. maximus. One hundred and thirteen SNPs of the 140 
tested were amplified and genotyped, 111 being polymorphic in a wild Cantabrian population, 
showing the utility of the new NGS techniques for true SNP detection (conversion rate = 79.3%). 
Diversity levels at the population were similar to previous studies [29,42,44] and were in accordance 
with HWE expectations. An important number of these polymorphic SNPs were located in the coding 
region and 16 of them (14.4%) represented non-synonymous changes at genes related to immune 
response and gonad differentiation processes as shown by the detected GO terms. Therefore, these 
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SNPs are valuable resources for future population genetics, high-resolution genetic maps, quantitative 
trait loci (QTL) identification, association studies and marker assisted selection (MAS) breeding in turbot. 
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